note: generally to ensure HA changes for multiple servers have not introduced installation / upgrade regressions for single-server setups
start with existing (upgrade) or new (overwrite) db
use installer to add server: name it server-1, leave default affinity group option selected
test behavior of old agent trying to connect to new server
should fail and report protocol version mismatch error
register a new agent agent-1 to this server
tail the server log and make sure the cache gets loaded for this agent
using HAAC agent view, verify agent-1 failover list is server-1 only.
stop the agent
start the agent (without "--clean")
tail the server log and make sure the cache gets reloaded for this agent
kill the server
wait several minutes for agent data to spool up
restart the server
tail the server log and make sure the cache gets reloaded for this agent
note: the cache must be loaded BEFORE any agent reports are sent
put the server into MM
tail the agent logs to ensure it is initiating failover
put the server into NORMAL mode
tail the server log and make sure the cache gets reloaded for this agent
note: the cache must be loaded BEFORE any agent reports are sent
register 5 more agents to the server (for a total of 6)
go to HA admin console, "affinity groups" section
ensure that servers and agents listed all have an empty affinity group value
using HAAC agent view, all 6 agents have failover list with server-1 only.
using HAAC server list view, server-1 should show agent count of 6 assuming all agents are running)
log into the application as the superuser
go to HA admin console, "servers" section
verify that there is the correct # of servers
exposed at the proper IPs/ports?
in the proper affinity groups?
properly shows which agents are connected to which servers?
do you see which servers are up/down correctly?
also, repeat "verify HA console affinity group view"
go to HA admin console, "agents" section
ensure affinity group information matches expected values for all agents listed
verify agent resource configurations
go to configuration>current subtab for each agent, verify the affinity group configuration property is correct for that agent
verify the most recent configuration update history reflects the correct value for each agent
also, repeat "verify HA console affinity group view"
go to HA admin console, "affinity groups" section
click edit
use UI controls to move servers and/or agents into different affinity groups
click save
verify results (there will be some redundancy since verify server/agent HA data tasks will both require verifying the data on the HA admin console, "affinity groups" section)
repeat "verify server HA tracking data"
repeat "verify agent HA tracking data"
log into HA admin console
click re-partition button
use other HA console UI pages to ensure that:
the agents are distributed correctly across all servers
If affinity is in use the distribution may not be even. Satisfying affinity is weighted more highly than even distribution. See HA Load Balancing for more detail.
the failover list for each agent include all servers (no servers should be listed in duplicate)
note: this performs an install using the same installer, but with different options. since you'll be configuring this server against the same existing database, this will be an HA install. it tests a different installer path, and also introduces server-side HA principles that need to be tested.
use installer, and point to the same database / port / dbuser you did earlier
name this instance server-2 (should run on a different machine or, minimally, the same machine and different port)
verify that the web UI can be reached from any server endpoint
repeat "ensure even agent load distribution"
Installing a server to the cloud will repartition the agents. Note that affinity will still be satisfied so previous agent affinity should still hold, creating an uneven distribution.
Note that affinity assignment changes also will repartition the agents.
take down one of the servers - shutdown operation (graceful) or kill it
repeat "verify server HA tracking data"
repeat "ensure even agent load distribution"
A server going down, or into maintenance mode does not repartition. Server lists remain the same and agents will fail over using their existing lists.
log into HA admin console, "servers" section
take an enterprise-wide snapshot of which agents are connected to which servers
click "maintenance mode" button next to any server
repeat "ensure AG-aware agent load distribution", but do not click the re-partition button
note, a full redistribution is not performed at this time - only the agents connected to the server that went into maintenance should fail over to their secondaries; validate this by taking another enterprise-wide snapshot of which agents are connected to which servers and compare against previous shapshot
click the "normal" button for the same server again (ending the maintenance period)
repeat "ensure AG-aware agent distribution", this time click the re-partition button as normal
Cloud member operation mode changes (going up or down, or in and out of maintenance mode) don't really affect the agent distribution algorithm. Server lists will include cloud members that may be temporarily unavailable. So, re-partitioning at this point should not have any impact on distribution.
Cloud size changes do affect the agent distribution. So, install or deletion of servers will have a major affect on distribution.
use the HA admin console > servers section to put one server into maint mode
wait a few moments (agents will failover to the remain server in NORMAL mode)
use HA admin console > servers section to ensure all agents have connected to a single server
put the server back into NORMAL mode
after some time (should be no more than 1 hour) agents will switch back to their primary server.
if you want to speed up this test you can reduce the 1 hour setting in the agent configuration (rhq.agent.primary-server-switchover-check-interval-msecs) to, say 10 minutes.
ensure all agents are connected to their primary server
temporarily block an agent from connecting to a server
firewall setting / port forwarding config / unplug one of them from the wall
if time is short (~15secs) verify that the agent does not fail over
use HA admin console to verify that agent failover history is unchanged
if time is long (~1min) verify that the agent fails over
use HA admin console to verify that agent failover history has new entries
repeat "add server to the cloud", for a total of 3 servers in the cloud
log into HA admin console
click re-partition button
use other HA console UI pages to ensure that:
the agents are evenly load-distributed across all servers (should be 2 agents per server, if no affinity is assigned)
the failover list for each agent include all servers (no servers should be listed in duplicate)
note: affinity groups (AG) provide a mechanism for agents to prefer to connect to some servers over others.
log into HA admin console
assign 1 server to AG-1 and 2 agents to AG-1 (the rest of the agents/servers won't be in any AG)
make sure the agents you assign to AG-1 are currently NOT connected to the server put in AG-1
click re-partition button, and wait a while
You could wait quite a while for this (like a day, as agents do not "pull" a new server list very often. As an alternative you can:
Restart the agents
Use the new agent operation, via the GUI, to force (all of) the agents to update their lists (this is preferred as the agent keeps running).
ensure that the AG-1 agents are now connected to the AG-1 server
ensure the other 4 agents are evenly distributed across the remaining two non-AG servers
repeat "add server to the cloud", for a total of 4 servers in the cloud
log into HA admin console
assign 2 more servers to AG-1 (now there are 3 servers in AG-1, and 1 server in no AG)
assign 2 more agents to AG-1 (now there are 4 agents in AG-1, and 2 agents in no AG)
click re-partition button, and wait a while
You could wait quite a while for this (like a day, as agents do not "pull" a new server list very often. As an alternative you can:
Restart the agents
Use the new agent operation, via the GUI, to force (all of) the agents to update their lists (this is preferred as the agent keeps running).
ensure that the 4 AG-1 agents are now connected to one of the 3 AG-1 servers
ensure the other 2 agents are connected to the non-AG servers
put one of the AG-1 servers into MM
ensure that the 4 AG-1 agents are now connected to one of the 2 remaining AG-1 servers
ensure the other 2 agents are still the only ones connected to the non-AG servers